Examples:

Traditional method: ordinal regression / ordered probit regression

Bayesian GLM approach:

23.1 Modelling Ordinal Data with an Underlying Metric Variable

23.2 The Case of a Single Group

23.2.1 Implementaion in JAGS

"
model {
  for (i in 1:Ntotal) {
    y[i] ~ dcat(pr[i,1:nYlevels])
    pr[i,1] <- pnorm(thresh[1], mu, 1/sigma^2)
    for (k in 2:(nYlevels-1)) {
      pr[i,k] <- max(0, pnorm(thresh[k], mu, 1/sigma^2) - pnorm(thresh[k-1], mu, 1/sigma^2))
    }
    pr[i,nYlevels] <- 1 - pnorm(thresh[nYlevels-1], mu, 1/sigma^2)
  }
}
"

Notes:

  • cumulative normal in JAGS is called pnorm
  • max(0,…) needed because thresholds randomly generated by MCMC chain so possible to get nonsensical result without max

Priors BB: again note uniform prior over log range; I think people should be more careful about this…:

"
mu ~ dnorm((1_nYlevels)/2, 1/(nYlevels)^2)
sigma ~ dunif(nYlevels/1000, nYlevels*10)
for (k in 2:(nYlevels-2)) {
  thresh[k] ~ dnorm(k+0.5, 1/2^2)
}
"

23.2.2 Examples: Bayesian estimation recovers true parameter values

Figure 23.2 and 23.3. Goals is to show that Bayesian estimation can accurately recover our parameter values even when \(\mu\) is extreme and the thresholds are not evenly spaced. Figure 23.2 data generated using \(\mu=1,\ \sigma=2.5,\ \theta_1=1.5,\ \theta_2=2.5,\ \theta_3=3.5,\ \theta_4=4.5,\ \theta_5=5.5,\ \theta_6=6.5\).

Remember parameter values need to be jointly credible: if we have a higher \(\theta_2\) threshold and want to maintain the likelihood of outcome 3 then we need to also increase \(\theta_3\). Causes a domino effect through all the thresholds.

23.2.2.1 Not the same results as pretending the data are metric

If instead we treat the data as though it were metric and normally distributed we get very different results:

  • Mean of 1.95 instead of 0.966 (data generated using mean of 1.0).
  • Sometimes the mean is ok but the predictions of individual outcome probabilities are terrible

BB: is it really that this special way of treating ordinal data is good and treating it as metric is bad, or is it really that when they were assuming the data was metric they assumed it was normally distributed when it clearly wasn’t, how would other metric data models compare?

23.2.2.2 Ordinal outcomes versus Likert scales

Likert scales are from 1=strongly disagree to 5=strongly agree. Remember for this chapter we are interested in describing the ordinal outcomes themselves, not the arithmetic average of several ordinal responses. If there are many related items on a questionnaire then the model of the ordinal data can use latent factors to express relations among the items.

23.3 The Case of Two Groups

Example: ask about Happiness for two groups - one after watching sports car ads, one after watching charity ads. Assume that both groups access the same underlying metric scale and thresholds, what differs is the mean and variance of the feeling.

23.3.1 Implementation in JAGS

Simple extention of the specification for one group.

23.3.2 Examples: Not funny

Again showing that comparison of means and std deviations of likert values for two groups are much better made through this Bayesian method than a traditional method that assumes the data is normal.

23.4 The Case of Metric Predictors

Figure 23.6 for important explanation of how we do linear regression to predict the underlying metric variable, then we apply our same model to convert underlying metric variable to ordinal predicted variable. Ordinal probit regression.

Figures 23.7 to 23.10 interesting examples.

23.4.1 Implementation in JAGS

Combination of linear regression models seen previously and the model we just illustrated above in this chapter. Predictors were standardized to improve MCMC efficiency, as we have also seen before. Predicted values are just categories and don’t need to be standardized so it is more similar to logistic regression in a way.

23.4.2 Examples: Happiness and money

Data fall mostly in the middle of the response scale (Fig 23.7 and Fig 23.8) Underlying regression is linear but it should asymptote so this is not a good model of the underlying value: there should be diminishing returns.

23.4.3 Example: Movies - They don’t make ’em like they used to

Have threshold lines that are level contours of the underlying metric planar surface (Fig 23.9)

However, the region the threshold lines bound as reserved for categories 7 and 1 don’t have any datapoints in.

23.4.4 Why are some thresholds outside the data?

Compare to artificial data (Fig 23.10) as an example; has many more data points that span the entire range.

Fig 23.11 is good for showing outcomes for noisy data. Y axis is probability of each outcome (different line for each outcome). Most probable outcome within interval is not necessarily the outcome centered on that interval

23.5 Posterior Prediction

How are posterior predicted probabilities with 95% HDI computed? At each step in MCMC chain compute predicted outcome probabilities using equations given and explained. From the full chain of credible posterior predicted outcome probabilities the central tendency and 95% HDI are then computed. Brief example how to do this below: use the MCMC chain coda object returned from JAGS (codaSample) then do the processing of it in R.

mcmcMat = as.matrix(codaSamples)
chainLength = nrow(mcmcMat)
beta0 = mcmcMat[,"beta0"]
beta1 = mcmcMat[,"beta"]
thresh = mcmcMat[,grep("^thresh",colnames(mcmcMat))]
sigma = mcmcMat[,"sigma"]

outProb = matrix(NA,nrow=chainLength,ncol=max)
for (stepIdx in 1:chainLength) {
  mu = beta0[stepIdx] + beta1[stepIdx] * xProbe
  threshCumProb = pnorm((thres[stepIdx, ] - mu)/sigma[stepIdx])
  outProb[stepIdx, ] = c(threshCumProb,1) - c(0,threshCumProb)
}
outMed = apply(outProb, 2, median, na.rm=True)
outHdi = apply(outProb, 2, HDIofMCMC)

23.6 Generalizations and Extensions

Approaches to outliers:

Variable selection: just as predictors in linear regression or logistic regression can be given inclusions parameters, so can our predictors in thresholded cumulative-normal regression.

Nominal predictors: works just fine, look back to Figure 21.12 to see how this might work. The only change is putting thresholds on the normal noise distribution to create probabilities for the ordinal outcomes.

23.7 Exercises

Exercise 23.2 [Purpose: Modifying the program to ]

setwd("./DBDA2Eprograms")
source("DBDA2E-utilities.R") # Load definitions of graphics functions etc.
source("Jags-Yord-XmetMulti-Mnormal-Guessing-Example.R")
setwd("./DBDA2Eprograms")
source("DBDA2E-utilities.R") # Load definitions of graphics functions etc.
source("Jags-Yord-XmetMulti-Mnormal-tDist-Example.R")

A

i.) Posterior distribution of the guessing parameter (alpha). Recall guessing parameter is \(\frac{1}{7}=0.1429\); this gets multiplied by alphas itself to get the probability added into each category.

Interesting (and expected) sigma is definitely reduced compared to not using a guessing parameter.

ii.) Are the regression coefficients a little more extreme? Yes. Why?

iii.) Is there anything unusual about the posterior distribution on the thresholds (see Figure 23.12), and why? Hint: even if the thresholds are randomly inverted in the MCMC chain, and the outcome probabilities from that component are zero, the random component still gives the outcomes a nonzero probability. Thresholds can get inverted; thresholds are generated randomly by the MCMC chain so sometimes it’s possible they become the wrong way around.

Normally (without random guessing mixture) when this happens we model that the probability of any data occuring in the category between the thresholds is 0. Therefore when we use the data to work out the likelihood and find there is some data in that category we get the likelihood of that to be 0, which indicates that parameters when the thresholds are the wrong way around are not credible, and this will be reflected in the posterior as well. Now the way MCMC Gibbs sampling works is that it spends time at particular parameters proportional to their posterior probabilities; if the posterior probabilities are 0 then those parameter combinations won’t be explored and won’t show up in the figure.

When we include the guessing mixture then the model now has some nonzero probability for data to be in any category no matter what the thresholds are, so suddenly the MCMC chain is now able to explore all these values.

B

Now change model to a thresholded cumulative-t-distribution model (with no random guessing mixture)

i.) Are the regression coefficients a little more extreme? Seem a little less extreme, sigma is somewhere between part A and the original example where the modal value was about 1.25.

ii.) Is there anything unusual about the posterior distribution on the thresholds (see Figure 23.12), and why? Thresholds no longer get inverted; simply using a slightly different distribution to the normal and nothing more fundamental about the model changes as we had before.